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ABSTRACT 

Online controlled experiments, now commonly known as 
A/B testing, are crucial to causal inference and data driven 
decision making in many internet based businesses. While a 
simple comparison between a treatment (the feature under 
test) and a control (often the current standard), provides a 
starting point to identify the cause of change in Key Perfor¬ 
mance Indicator (KPI), it is often insufficient, as the change 
we wish to detect may be small, and inherent variation con¬ 
tained in data may obscure movements in KPI. To have suffi¬ 
cient power to detect statistically significant changes in KPI, 
an experiment needs to engage a sufficiently large propor¬ 
tion of traffic to the site, and also last for a sufficiently long 
duration. This limits the number of candidate variations 
to be evaluated, and the speed new feature iterations. We 
introduce more sophisticated experimental designs, specifi¬ 
cally the repeated measures design, including the crossover 
design and related variants, to increase KPI sensitivity with 
the same traffic size and duration of experiment. In this pa¬ 
per we present FORME (Flexible Online Repeated Measures 
Experiment), a flexible and scalable framework for these de¬ 
signs. We evaluate the theoretic basis, design considerations, 
practical guidelines and big data implementation. We com¬ 
pare FORME to an existing methodology called mixed effect 
model and demonstrate why FORME is more flexible and 
scalable. We present empirical results based on both sim¬ 
ulation and real data. Our method is widely applicable to 
online experimentation to improve sensitivity in detecting 
movements in KPI, and increase experimentation capabil¬ 
ity. 

Categories and Subject Descriptors 

G.3 [ Probability and Statistics]: Experiment Design 

General Terms 

Measurement, Experimentation Design, Web search, A/B 
Testing 


1. INTRODUCTION 

Many recent publications attest to the power of using online 
A/B testing as the golden rule for making causal inference 
in web facing companies large and small. By random assign¬ 
ment of feature to otherwise balanced groups of users and 
measuring subsequent changes in user behavior, A/B test¬ 
ing isolates effect of feature change, i.e. the treatment effect 
from extraneous sources of variance. 

To perform statistical inference in both point estimation and 
hypothesis testing for the treatment effect, while controlling 
type I error at pre-specified level, we would desire lower type 
II error, or equivalently, higher powered experiments. That 
is, we wish to be able to detect the effect when there is any. 
Running under powered experiments have many perils. Not 
only would we miss potentially beneficial effects, we may also 
get false confidence about lack of negative effects. Statistical 
power increases with larger effect size, and smaller variances. 
Let us look at these aspects in turn. 

While the actual effect size from a potential new feature may 
not be known, we generally select a size that makes business 
sense, i.e. one that justifies the cost of feature development 
and ongoing maintenance of the code base. Dramatic fea¬ 
tures that drastically alter user behavior and get reflected 
in KPI as large effect sizes are few and far in between. Of¬ 
ten the candidate feature has but a small effect on the KPI. 
Nonetheless, by accumulating a portfolio of small changes, 
a business can achieve big business success. Quote from 
Rule #2 of |Kohavi et al. (|20l4|), winning is done inch by 
inch. This is especially true for mature web facing busi¬ 
nesses where most low hanging fruits were picked already. 

In general one expects variance to decrease with increased 
sample size. But this is not always true. For online business, 
at first glance it may seem that the number of visitors may 
be large, and with a casual look people may think the power 
to detect any change is large. In reality, however, intrinsic 
variation between users is large and may obscure the small 
movement in KPI. Variation in measured treatment effect 
comes from various sources. Exogenous to the treatment 
itself includes user to user variation, e.g. some users from 
slower internet connection would always have slower page 
load time regardless of what experiments are run. Variance 
for some metrics does not decrease over time, instead they 
plateau after some period of time (say, two weeks), and run¬ 
ning longer experiments no longer results in corresponding 
benefits (see [Koh avi et al7|[20T2] Section 3.4). This poses 








a limitation to any online experimentation platform, where 
fast iterations and testing many ideas can reap the most 
rewards. 


1.1 Motivation 

To improve sensitivity of measurement, apart from accurate 
implementation and increase sample size and duration, we 
can employ statistical methods to reduce variance. Using 
the user’s pre-experiment behavior as a baseline for his/her 
post-experiment behavior, we can reduce the variance in 
measured treatment effect. The experiment setup in a two- 
week experiment is shown in Table [l] The typical A/B test 
is illustrated in the first row. In the past we have used 
regression to reduce variance (CUPED: Controlled Experi¬ 
ments Using Pre-Experiment Data, see Deng et al. (2013)) 
and have achieved good results, e.g. reducing variance in 
number of queries per unique user in a given time period by 
40-50%. CUPED has the benefit of having readily available 
baseline data “for free”. This improvement is performed with 
existing design, using the “free” data as covariates only in the 
analysis stage. CUPED is in fact a form of repeated mea¬ 
sures design, where multiple measures on the same subjects 
are taken over time. In particular, in the pre-experiment 
stage, all users received the default feature C (control) and 
none received the new feature T (treatment). 
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Table 1: Repeated Measures Designs 


In this paper we extend the idea further by employing the 
repeated measures design in different stages of treatment 
assignment. The traditional A/B test can be analyzed us¬ 
ing the repeated measures analysis, reporting a “per week” 
treatment effect, as show in row 3 “parallel” design in ta¬ 
ble [I] The two week experiment can be considered to be 
conducted in two periods, even though users received the 
same treatment assignment during both periods. In one 
of the new designs, the “crossover” design, in contrast, we 
swap treatment assignment half way through the experiment 
(row 4 in table [I]). Each user will be exposed to both ver¬ 
sions of the treatments, instead of only one of the two in 
the usual A/B testing scenario. In sequence, a user will re¬ 
ceive either T followed by C, or C followed by T, with the 
flight re-assignment happening at the same moment for all 
users. Instead of randomizing treatments to users, we ran¬ 
dom treatment sequences (TC or CT) to users. This way 
each user serves as his/her own control in the measurement. 
In fact, the crossover design is a type of repeated measures 
design commonly used in biomedical research to control for 


within-subject variation. We also discuss practical consid¬ 
erations to repeated measures design, with variants to the 
crossover design to study the carry over effect, including the 
“re-randomized” design (row 5 in table [lj. 

1.2 Main Contributions 

In this paper, we propose a framework called FORME (Flex¬ 
ible Online Repeated Measures Experiment). We made con¬ 
tributions in both novel application and new methodology. 


Novel applications. We propose different experiment 
designs with repeated measurement. We demonstrate 
through real examples the value of these new designs 
comparing to traditional A/B test. Methods for model 
assumption checking is also presented. We also com¬ 
pare different designs for practical use and propose a 
general workflow for practitioners. 

New Methodology. We review standard repeated mea¬ 
sures models in the framework of mixed effect mod¬ 
els. We present a new method to fit the model that 
is scalable to big data. Our method is flexible in 
the sense that it makes far less assumptions than tra¬ 


ditional method based on mixed effect model (Bates 
[eFalTj |2012a[ ). It naturally handles missing data with¬ 
out missing at random assumption (common in online 
experimentation) and still provides unbiased average 
treatment effect estimation when mixed effect model 
fails. FORME can fit different types of repeated mea¬ 
sures models under the same framework. It also can 
be applied to metrics beyond those defined as a simple 
average, such as metrics defined as a function of other 
metrics. 


2. ILLUSTRATION OF FORME 

In this sections we will take a close look at several designs, 
with a treatment and a control, and with experiments car¬ 
ried out over several periods. Many common online metrics 
display different patterns between weekdays and weekends. 
Therefore experiments at Bing and many large IT compa¬ 
nies, in general are run for at least a full week to account 
for the difference between weekdays and weekends. In the 
following section we assume the minimum experimentation 
“period” to be one full week, and may extend to up to two 
weeks. To facilitate our illustration, in all the derivation 
in this section we assume all users appear in all periods, 
i.e. no missing measurement. We also restrict ourselves 
to metrics that are defined as simple average and assume 
treatment and control have the same sample size. We fur¬ 
ther assume treatment effects for each subjects are fixed. 
We emphasis this is just for illustration purpose and our 
method does not rely on these assumptions and we describe 
how we handle missing data and more complicated metrics 
in Section [4] Impatient reader who are familiar with re¬ 
peated measures analysis might jump over to Section [5] to 
see details of FORME’s model assumptions and comparison 
to linear mixed effect model. 

Denote the metric value mean in the treatment group as 
fir, and that in control as /re- We are interested in the 
average treatment effect (ATE) 5 = mt — Me which is a 
fixed effects in the model in this section. This way, various 
designs considered can be examined in the same framework 
and easily compared. 















We will proceed to show, with theoretical derivations, that 
given the same total traffic 

• Variance using CUPED < T-Test 

• With CUPED: Variance in parallel design < Cumula¬ 
tive Design 

• Variance in Crossover design < Parallel Design 

Denote observed sample values in the treatment groups and 
time periods as X, and their means /3. Note that X is a vec¬ 
tor of metric values X t for different time periods indexed by 
i, and the treatment effect S can be formulated as a function 
of /3 depending on model specification. Under the central 
limit theorem (CLT), with sufficiently large samples X is 
asymptotically normal 

X~N( t 3,E). 

The likelihood of /3 given observed data is then 

L = —' exp (hx - 0) T Y~\X - /?)) 

V27t|E|2 2 ! 

To get maximum likelihood estimates (MLE) of /3, denoted 
by /3, we seek to minimize -21og(Likelihood) 

l = -*(X - /3) t S' 1 (V - /3) + const 

Solving -|i = 0 gives MLE of /3. And its variance-covariance 
matrix is 

where Fisher Information 

10) = -E[(j-Jogf{X\P)) T logf{X\P)\p]. 

In the following sections we will explicitly model the mean 
/3 as a function of other parameters /3(A), one of the compo¬ 
nents is treatment effect S, and study expected variance of 
the MLEs of A. In fact this is simply: 

Var(X) = 7(A) -1 

= [(^) T S- 1 73[(A-/3)(X-/3) t ]E -1 ^] -1 

^<9A' " 7>aJ 

Coefficient of variation (CV) defined as the mean over stan¬ 
dard deviation of a metric, determines the sensitivity or the 
power of the experiment, given the same sample size. To 
study sensitivity or power of various experimental designs, 
once we have established that effect size remains relatively 
stable across different measuring periods, we can then focus 
on variation of estimated effect size solely. Specifically, the 
diagonal cell in Var{ A) corresponding to treatment effect 5 
gives its variance, and is our main focus in the following of 
this section. 

Analysis from randomized two-group experiments employs 
the two sample t-test under the usual A/B testing scenario. 
As a gentle introduction we will first look at the t-test using 
this notation. 


2.1 Two Sample T-test 

Let X denote the observed average metric value in control 
group and Y denote that in the treatment group. Since 
users are randomly assigned into either treatment or control 
group, X and Y are thus independent. For simplicity of 
notation, we assumed variance in the two group to be equal. 
Given large enough sample size, under CLT, and plug in 
observed sample variances, we have: 

where p is mean metric value in the control group, S is the 
treatment effect compared to the control group, and s 2 x and 
Sy are variances of X and Y respectively. Here A = (p,5), 
and the —2ZopLikelihood, denoted by l, of parameter vector 
/3(A) = (p, 5 ) t given observed data is then 

1 = [v- 6 -^] [ 0 Y ,2,] [y-<5- M ] + const 

Solving for MLE of S and obtain its variance as 

Var(5)\ T Test = S 2 X + Sy ( 1 ) 

It is simply the sum of variances from the treatment and 
control groups, which is the asymptotic variance of X — Y. 

2.2 Use Pre-experiment Data for Variance Re¬ 
duction 

At the analysis level, different models seek to explain the 
amount of variation in observed data, which may come from 
intrinsic, within-user difference, as well as variation intro¬ 
duced by differential treatment. For example, users that 
connect through broadband tend to have faster page load 
time than people using dial-up connection. This difference 
exists regardless of which treatment conditions the users are 
exposed to, and is thus irrelevant when measuring difference 
introduced by different treatments. As a result, the mea¬ 
surements on the same users over time tend be positively 
correlated. 

CUPED and previous work has established that by includ¬ 
ing covariates that are unrelated to the treatment, we can 
improve sensitivity and reduce variance of estimated treat¬ 
ment effect. Specifically, the users’ pre-experiment behav¬ 
iors servers as a good baseline for their behavior during the 
experiment. By including pre-experiment data as a covariate 
in the regression model for treatment effect, we can reduce 
the variance of the estimated treatment effect. 


Denote the pre-experiment average metric value to be A'o 
and Yq for the later control and treatment groups respec¬ 
tively. By CLT 



where 9 is the difference between the pre-experiment and 
experiment periods, i.e. the longitudinal effect, and p is the 
correlation between the two periods. Here A = ( p,S,9 ). We 
assume correlation p in both treatment and control groups 
to be the same for simplicity. Results do not dependent 
on this assumption. Even though the two treatment groups 
are still independent, metric value measured on the same 



group of users across different time periods are in general 
correlated. As we will later see, this correlation effectively 
reduces variances on 8 . Similarly we can solve for MLEs 
from solving partial derivative of l = 0 and derive variances 
for these estimates. 

Var(S)\cuPED *= 2s 2 (l — p 2 ) (2) 

It’s easy to see M has smaller variance of 8 than 0 by 
amount of 2p 2 s 2 . As users’ behavior is usually consistent 
across time, i.e. with non-zero correlation p among differ¬ 
ent time periods, this amount is positive. The amount of 
variance reduced is p 2 that of the original variance. 

2.3 Cumulative vs. Parallel Design 

Note that in the previous design we make no assumption on 
the duration of the pre-experiment and experiment periods. 
Empirical studies in D ong et al.| ( |2013| ) have shown that 
using one-week pre-experiment data provides similar amount 
of variance reduction as using even longer durations. For 
simplicity, in practice we recommend using one-week such 
data. 

And we have mentioned that to capture the difference be¬ 
tween weekday and weekends, we recommend running exper¬ 
iments for whole weeks, typically 14 days. Assuming treat¬ 
ment effect is the same across time, this gives us two ways 
of reporting treatment effects, i.e. reporting cumulative ef¬ 
fects for the whole 14 days, and reporting weekly treatment 
effect as a weighted average between observed values in the 
two weeks. For the latter, using the same notation as above, 
we have 
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We can solve for MLE and their variances. 


Var(5)\parallel = 2 


sjaj (1 - p 2 ) 
si +4- 2pSlS2 


(3) 


For the former, if the metric value is strictly additive across 
time, an example being revenue, under our toy model where 
all users appear in both periods, the cumulative treatment 
effect would be 8 = 28, since 

E = Var(X i +X 2 ). 

Using 0 , variance for the MLE is 

Var( 8 )\ Cumulative — 2Var(X i -t- A 2 ) — 2(s^ -t- s 2 -t- 2psis 2 ) 

(4) 

Comparing coefficient of variation (CV) in 0 to 0 , 

Var(8) _ Var(8) _ (si + s 2 ) 2 (si - s 2 ) 2 
4 8 2 8 2 2 S 2 (sl + s\ — 2psis 2 ) — 


Equality holds when the two periods have identical variance, 
i.e. Si = s 2 . In other words, for additive metrics which 
variation over time is large, reporting weekly metrics alone 
will improve metric sensitivity. 


For non-additive metrics, such as ratio metrics like Click 
Through Rate (CTR), the derivation becomes more involved. 


In practice, also there is a lot of non-recurring users. We 
opted to show empirical results instead in results section. 

Careful readers may have noticed, this method makes a key 
assumption that treatment effect 8 remains the same in the 
two weeks. To check this assumption, we can explicitly test 
for <5i = 82 by fitting the model this way: 


~x 1 ' 
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and test for the equivalence of MLEs Ho : 81 = 82 . The 
parallel design is appropriate if we fail to reject Ho- 

2.4 Crossover Design 

Now with the preliminary background information setup, 
we then look at variation reduction achieved through the 
crossover design. 

The crossover design employs a similar idea to CUPED. In¬ 
stead of using pre-experiment data as the baseline, in cross¬ 
over experiments, each user is exposed to both treatments 
sequentially, while the order of treatment groups is deter¬ 
mined randomly. Each user’s behavior while he or she is on 
the control condition serves as a baseline for his or her behav¬ 
ior on the treatment condition. By accounting for within- 
user variation, analysis based on the crossover design also 
reduces variance for the estimated treatment effect. 


In causal inference, we often seek to eliminate any confound¬ 
ing factors and isolate the root cause of observed difference. 
Due to not observing the counterfactual in the potential out¬ 
come framework! Rosenbaum and Rubin 1983; Morgan and 


Winship 2007), randomization is used to make the control 
group as the surrogate for counterfactual. This surrogate 
only works on average. In reality often some imbalance in 
some observed or unobserved factors will remain. Cross¬ 
over design uses each test subject as his or her own control, 
thus reducing the influence of confounding covariates, and 
achieve better sensitivity in estimating treatment effect. 


Distribution of observed sample averages is: 
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Similarly, treatment effect estimate has variance 


Var(8)\c rossover 
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(5) 


Comparing 0 to 0, it is obvious that in the crossover 
design, treatment effect has smaller variance as long as the 
correlation p is positive. Similar to CUPED, the amount of 
sensitivity improvement is determined by the size of p. The 
larger the correlation between time periods, the more im¬ 
provement the crossover design has over the parallel design. 
The equivalence of treatment effect can be similarly checked 
as in section [231 


2.5 Absolute or Relative Change? 

So far in this paper we considered the absolute treatment dif¬ 
ference 8 = p,T — pc- In practice we measure thousands of 
































metrics simultaneously. These metrics may have vastly dif¬ 
ferent magnitude in their treatment effects. Even the same 
metric measured over different duration, or over different 
sample sizes may have different absolute <S’s. This renders 
comparison of effect size across different experiments diffi¬ 
cult. To overcome this difficulty, we often seek to measure 
percent delta, %<5 = ■ 100%. The relative change is less 

influenced by the base difference and is a more robust mea¬ 
sure of treatment effect. In online experimentation we usu¬ 
ally deal with hundreds of thousands of samples, therefore 
CLT still holds and relative change would still have asymp¬ 
totic normality. The additive model described above can be 
readily adapted to model relative difference instead of abso¬ 
lute difference, by formulating the expected group means in 
the mixed variance-covariance structure model. For exam¬ 
ple, the crossover model with relative treatment effect can 
be written as: 
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Theoretic derivation to show variance reduction can be com¬ 
plex, but MLE estimates and their variances can be easily 
solved using numeric methods. 


2.6 The Unified Theme 

We illustrated different model designs of FORME. Careful 
readers might already noticed that the unified theme here is 
to study the joint distribution of X and Y, which by cen¬ 
tral limit theorem is known to be multivariate normal. Each 
model specification maps to the mean vector /3 of this mul¬ 
tivariate normal. Therefore for any mean vector based on a 
model specification, we can solve the MLE and estimate its 
variance using Fisher’s Information. The difficulty, however, 
lies in how to estimate the covariance matrix in general case 
with presence of missing data and in general for metrics that 
are not defined as simple as average. For crossover design, 
in particular, we also need a way to decide whether we can 
safely assume the treatment effect in both periods are the 
same without any carry over effect. We will address these 
in details in Section[3]and Sectionjf] Section [5] explains why 
FORME is more flexible and scalable than existing method 
of fitting linear mixed effect model. 


3. CARRY OVER EFFECT 

The crossover design is not without concerns. An important 
assumption in crossover model is that the treatment effect 
remains the same in the experimental periods. Since test 
subjects randomly receive all combinations of treatments in 
sequence, different users will receive difference sequence or 
“order” the treatments. It is possible that the order in which 
users are exposed to treatments may change the effect. For 
an extreme example, suppose our treatment introduced a 
bug that results in severely negative user experience, and 
these group of users fail to revisit the website in the later 
crossover period, the treatment effect is then different in the 
two periods. We call this the carry over effect, as the users 
exposed to treatment first, and then the control later may 
behave differently from the other group. 

In some experiments where the treatment condition is less 
noticeable to the users, the expected treatment effect is 
small, and based on historical insight, it is safe to assume 


no carry over effect exists. Usually a “wash-out” period can 
be injected in between treatment periods. 

3.1 Wash-out Period 

This approach calls for a “wash-out” period after the end 
of the first period, where all users will receive the control. 
Data from the wash-out period can be analyzed in similarly 
in linear mixed model to estimate the carry over effect and 
subsequently inform the design of later stage. We leave this 
as an exercise to the reader. 


3.2 Estimate Carry over Effect 

In the crossover design where only two groups are allocated, 
it is not hard to see that potential carry over effect is con¬ 
founded with the week to week difference of treatment ef¬ 
fect. Using only the crossover model, we can only measure 
one of these two effects. As an alternative, at the cost of 
less efficiency gain over the traditional non-crossover design, 
we may estimate the carry over effect explicitly using the 
following 4-group design. 


In a 4-group re-randomized design, we conduct the experi¬ 
ment over two periods, and split the users into four equally 
sized groups, one receiving controls in both periods, one re¬ 
ceiving treatments in both, and one receiving treatment fol¬ 
lowed by control, and the last receiving control followed by 
treatment, we can then tease apart carry over effect and 
treatment effect. We call this the re-randomized design, as 
it is equivalent to having another round of user randomiza¬ 
tion between the first and the second period. Using notation 
from the linear mixed model, the model considering a po¬ 
tential carry over effect a is then 
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The carry over effect is in the group that received first treat¬ 
ment and then reverted back to control in the second stage. 
Using observed data we can estimate carry over effect a as 
an additional term in (3. When a is not statistically signifi¬ 
cant under pre-specified type I error cutoff (usually 0.05), it 
is safe to drop the term a from /3, and re-fit the model. This 
way, we gain one more degree of freedom and thus reduces 
variances in the MLEs. 


This approach enables direct estimation of carry over effect. 
It can also be considered a hybrid approach between the 
crossover and parallel design, as half of the users received 
crossed treatments, and half received the same treatments 
in the two periods. It is not hard to see this design will 
achieve sensitivity improvement between the crossover and 
parallel designs. 


4. MISSING VALUES AND METRICS BE¬ 
YOND AVERAGE 

4.1 Loss of follow-up and intent to treat 
















Loss of follow-up is a common term in clinical studies. It 
refers to patients who were active participants during some 
period of the trial, but stopped participating at some point 
of follow-up. This can lead to incomplete study results and 
potential bias when the attrition is not random. Intention- 
to-treat analysis is commonly employed, when the subjects’ 
initial assigned treatment is used regardless of actual re¬ 
ceived treatment. In online A/B testing the idea is similar; 
users are assigned to treatment groups at some point in time 
before the experiment starts, often by user id, but may or 
may not appear in the actual duration of the experiment. 
This missing pattern is far from random, therefore methods 
that rely on strong MCAR assumption (missing completely 
at random) are not appropriate and even MAR (missing at 
random) assumption is questionable as it requires missing 
pattern to be random conditioned on observed covariates. 
One way to see measurements are not missing at random is 
to realize infrequent users are more likely to have missing 
values and the absence in a specific time window can still 
provide information on the user behavior and in reality there 
might be other factors causing user to be missing that are 
not even observed. Instead of throwing away data points 
where user appeared in only one period and is exposed to 
only one of the two treatments, in practice, we included an 
additional indicator for whether or not the user appeared in 
the study in the period. 


Specifically, we use an additional indicator for the pres¬ 
ence/absence status of a user in each experimentation pe¬ 
riod. For user j in period i, let hj = 1 if user j appears in 
period i, and 0 otherwise. For each user j in period i, in¬ 
stead of one scalar metric value ( Xij ), we augmented it into 
a vector When Uj = 0, i.e. user is missing, we 

define X. l:l = 0. Under this simple augmentation, the metric 
value Xi for period i, taking average over those non-missing 
measurements, is the same as E k E** . In this connection, to 

Yk J ik 

obtain MLE and its variance, we only need to estimate the 
covariance matrices for each group across time periods, i.e. 


Cov(Xi, Xi>) = Cov 


fEkXik E k > AV fc > \ 

V Ek i ik Ek 1 / 


= Cov 



Xj, 

w 


where the last equality is by dividing both numerator and de¬ 
nominator by the same total number of users who have ever 
appeared in the experiments. Thanks to the central limit 
theorem, the vector ( I, , X ,, I t i , X,i ) is also asymptotically 
normal. Plugging in observed sample means and covari¬ 
ance matrix, Cov(Xi, JQ/) can be trivially computed with 
the delta- method; see |Deng et al.| (|2013[ A ppendix B) for 
a similar example; also see ( |Van d er Vaart 2000) for a text 
book treatment of the delta- method. 


4.2 Metrics Beyond Average 

Treatment groups are assigned to users, but not all metrics 
are simple averages across users. We can define a metric as 
a function of other metrics. One important family of met¬ 
rics is page level metrics such as click through rate. Page 
level metrics use number of page-views as their denomina¬ 
tor. At first glance it might look like just a simple aver¬ 
age. Since treatments are assigned to users (the independent 
unit), page-views are therefore not independent. Consider¬ 


ing it as simple average over page level measurements this 
needs extra care. A better approach is to see this as a ratio 
of two user level metrics: clicks per user and page-views per 
user. 

E^userj ClickSi _ E-u-ser, ClickSj/ E-user, P 

E user, PageSi E-user, P^eSi/Euser , P ’ 

where E-user P i s the count of appeared users. The 
same delta-method mentioned in Section |4.1| naturally ex¬ 
tends here, with slightly more complicated formula. Since 
delta- method applies in general to any continuous function, 
we can handle any metric that is defined as a continuous 
function of other metrics. 


5. FLEXIBLE AND SCALABLE REPEATED 
MEASURES ANALYSIS VIA FORME 
5.1 Review of Existing Methods 

It is common to analyze data from repeated measures design 
with the repeated measures ANOVA model and the F-test, 
under certain assumptions, such as normality, sphericity (ho¬ 
mogeneity of variances in differences between each pair of 
within-subject values), equal time points between subjects, 
and no missing data. Such assumptions in general do not 
hold for large-scale online experiments, where the assign¬ 
ment of users into different treatment group may not be 
completely balanced. 


A generally more applicable method is to analyze the data 
using the linear mixed effect model, for which complete bal¬ 
ance is not necessary ( |Bates et al.|2012a[ ). In particular, a 
linear mixed effect model treat each measurement of a sub¬ 
ject as a data point, and model the measurement as 

Y = 9 + aX + PZ + e 


Here 9 is the global mean and a stands for the vector of all 
deterministic fixed effects while /3 is the vector of all random 
effects and t is noise. X and Z are covariates in the model. 
In our cases they are indicators of treatment assignment, 
periods of the measurement, user id, and any other covariate. 
As an example, one possible model for repeated measures 
using lme4’s formula syntax (Bates et al. |2012 affe l is 

Y ~ 1 + IsTreatment + Period + (l\UserID), 


where the only difference of this model to the usual lin¬ 
ear model behind two sample test is the extra random ef- 
fect(clustered by UserlD) to model user “baseline”. More 
complicated models exist to further model interaction and 
joint random effects. 


Random effect makes modeling within-subject variability 
possible. In repeated measures data, users might appear in 
multiple periods, represented as multiple rows in the dataset. 
As a result, rows of the dataset are not independent but 
with dependencies clustered by user. To see this, each user’s 
“baseline” measurement is captured as a random effect. The 
same user in different period will share the same “baseline” 
random effect, therefore resulting in dependency. Mixed ef¬ 
fect model effectively takes advantage of this and is able to 
estimate the variance of the random effect while reducing the 
variance of average treatment effect. In the case of cross¬ 
over design, the model can take advantage of the positive 
correlation between the two periods of the same user, which 
















improves accuracy in the estimation of treatment effect, sim¬ 
ilar to the illustration we derived in Section \2 .41 Treatment 
effect can be modeled as either a fixed effect or random ef- 
feciQ If our interest is the average treatment effect, we can 
model it as a fixed effect. Note that modeling treatment 
effect as fixed effect does not mean we need to assume it is 
fixed, which in general is not since different subjects react to 
the treatment differently, but rather because the focus here 
is the mean of the random treatment effect, not the vari¬ 
ance of the random treatment effect. One can still fit the 
model with random treatment effect and the results gener¬ 
ally agree, though fixed effect is believed to be more robust 
against model assumptions; see |Wooldridge| ( p0 12|. 

We point out two issues of using traditional mixed effect 
model, and claim that FORME is a better alternative on 
axes of flexibility and scalability. 


First, linear mixed effect model (and also generalized linear 
mixed effect model) is a family of parametric models, and re¬ 
lies on full knowledge of the likelihood function to perform 
parameter fitting. This means the model need to rely on 
distributional assumptions such as normality. In particular, 
all random effects are typically modeled as normally dis¬ 
tributed or jointly normally distributed. And noise e need 
to be either i.i.d normal or the modeler needs to provide a 
known covariance matrix. These assumptions are indispens¬ 
able in the theory and pivotal in the fitting of the model. 
For our application in online A/B Testing, many of these 
assumptions are inappropriate. To name a few, for a metric 
like revenue per user, it is inappropriate to model the user 
“baseline” revenue per week as normally distributed due to 
its large skewness. Also the noise term e is hard to justify to 
be truly independent of other random effect. A heavier user 
might have bigger “baseline” revenue, and also bigger noise, 
and bigger (or smaller in some cases) treatment effect. It 
also assumes data are missing at random. Modelers of linear 
mixed effect model will need to modify the model by making 
random effects jointly random, or including more interaction 
terms. However the more complicated the model, the more 
questions on model assumptions will arise. We show in Sec¬ 
tion [6] through simulation study, linear mixed effect model 
fitted in R package lme4( Bates et al. |2012 a) could result in 
biased estimation of the average treatment effect when there 
is correlation between data missing pattern and user random 
effect. 


Second, fitting mixed effect model could be expensive. Avail¬ 
able packages in SAS or R are based on fitting MLE or 
REML(restricted maximum likelihood). In either case, much 
effort is taken to estimate the variance of random effect(s) 
or covariance matrix if they are jointly random. Fitting al¬ 
gorithm takes the full dataset with each row representing a 
measurement. In online A/B testing, where tens of millions 
of users are involved, this dataset could be large. In model 
fitting, each iteration requires some operations on this full 
dataset. Making the efficiency of model fitting a concern 
in big data scenario. To the authors’ best knowledge, there 
is no literature on the topic of big data implementation of 

1 When there are only two measurements for a subject like 
crossover design, modeling treatment effect and user “base¬ 
line” both as random effect is unidentifiable. But the model 
can be fit if there are more measurements per subject. 


linear mixed effect model. In our experience FORME is 1 
to 2 magnitudes faster than lme4 with much less memory 
footprint even without map-reduce type parallelism. 

In the remaining of this section, we explain why FORME 
is both more scalable and flexible than linear mixed effect 
model. 


5.2 FORME is Scalable 

Instead of modeling at the level of each individual measure¬ 
ments, FORME sees the problem from a higher level and 
take advantage of big data. Based on central limit theorem, 
metrics of interest in each period for treatment and control 
follows normal distribution. Using the same notation in Sec¬ 
tion [2j this multivariate normal random vector is denoted by 
Xi,Yi,i = 0,... ,2, with mean /3( A) and certain covariance 
matrix. These metric values are correlated with each other 
via common user level random effects modeled explicitly in 
linear mixed effect model but not in FORME. This is be¬ 
cause when our interest is only in the average treatment 
effect, the estimates of those random effects are irrelevant. 
Instead, FORME sees the average treatment effect 8 as just 
one parameter in the mean vector of the metric values /3(A). 
That is, when modeling metric values directly using mul¬ 
tivariate normal distribution with parameters in the mean 
vector, all the complexities involving the structures of the 
random effects are buried in the covariance matrix of multi¬ 
variate normal and we are left with a simple task, which is 
to estimate the parameters A of this multivariate normal. 


FORME estimates A by fitting MLE. The use of asymptotic 
statistics also guarantees that the estimates are normally 
distributed with covariance matrix derived from Fisher’s In¬ 
formation ( Van der Vaart|20 00). Note that the scale of this 
step is much smaller than the MLE fitting of a typical linear 
mixed effect problem. FORME only need to fit a multivari¬ 
ate normal with small dimension, typically smaller than 12 
(6 in a crossover design: treatment and control for each of 
the pre-experiment, period 1 and period 2.) 


The main computation burden is therefore in the estimation 
of covariance matrix. Fortunately, this step only involves 
estimation of pair-wise covariance between metric values, 
and they all can be map-reduced with one pass of the data. 
To handle missing data and general form of metrics (as a 
continuous function of other metrics), deZfa-method can be 
employed (Section[4|. The application of deZta-nrethod only 
involves slightly more complicated covariance matrix so we 
need to estimate more covariance pairs in one map-reduce 
pass of the data, inducing negligible increase in complexity. 


5.3 FORME is Flexible 

FORME is not only scalable but also more flexible. Because 
FORME doesn’t explicitly model random effects as linear 
mixed effect models do, FORME makes no distributional 
assumptions on random effects and noises e. FORME also 
make zero assumption on missing data pattern. FORME 
needs only one critical assumption , i.e. that central limit 
theorem is applicable, which is rarely violated in online A/B 
testing, since traffic size is large enough even for the most 
highly skewed metrics such as Revenue 
Specifically, FORME can be applied to 


(Kohavi et al. 20141. 
all these cases: 












1. Data can have arbitrary missing pattern. In other 
words not assumptions on missing at random. 

2. Treatment Effect is random. 

3. Treatment Effect and user random effect (baseline) are 
not independent. 

4. Noises e are not i.i.d. 

5. Noise and random effects are not independent. 

6. Interactions, (e.g. treatment and control have differ¬ 
ent user random effect distribution, etc.) 

To close this section, we make the final remark that the 
flexibility of FORME really comes from its simplicity, com¬ 
paring to linear mixed effect model. We believe FORME 
is also easier for practitioners to understand. The cost of 
FORME to put less assumptions than mixed effect model 
is the expectation that when mixed effect model assump¬ 
tions hold, FORME estimate could possess larger variance 
than mixed effect model estimate. Next we’ll explore these 
through simulation study. 

6. RESULTS 

6.1 Simulation from Known Distributions 

We compare variances reported from our FORME produces 
to the traditional linear mixed model under various simu¬ 
lation assumptions. As illustration we used the crossover 
design. We simulate a total of 2N users, where N = 10000 
and randomly split them into two treatment groups. 

X t ,j — // 4“ Sij “t“ Ui dij , dij N (0, fX ) 

where i is index for user and j for time period, dij represents 
random noises and u; represents random user “baseline” ef¬ 
fect. Sij is the treatment effect for user i in period j (0 if not 

in treatment). In this model, the between period correlation 
2 

is then ■ If user i is in treatment for time period j, 

Sij ~ N(S, erf) xpi, where S x E(pi) is the ground truth aver¬ 
age treatment effect size, p; is a continuous value between 0 
and 1, and it represents the user’s activity level. We designed 
Pi to be correlated to Ui. This way we allow treatment effect 
to vary by how frequent a user visits the site. Finally we al¬ 
low Xij to be missing with probability of max(90%, 1 — Pi). 
This is intuitive since a less active user would be missing 
more often. Note that in this simulation study we know 
exactly what the true average treatment effect is. We sim¬ 
ulated this process K = 10000 times so that we can have 
a good estimate of the ground truth variance of treatment 
effect estimated by FORME and mixed effect model (lme4). 
We want to learn the following for both FORME and lme4 
from this simulation study: 1) is estimate unbiased, 2) is 
variance estimation correct. If both methods are unbiased, 
then we want to know which one has smaller variance. With¬ 
out loss of generality, we used p = 0, a = 4, a u = 2 or 4, 
8 = 10, and as = 0.1 \Zl2. We chose 5 simulation conditions 
as the following: 


and all conditions have roughly 50% user missing in each 
period. 

First of all in condition 3 when there is a random treat¬ 
ment effect, we found lme4 consistently gave biased esti- 
mation(when ground truth effect is 6.6, FORME estimates 
are very close to ground truth while lme4 always gave bi¬ 
ased estimation around 7.2). This is because lme4 relies on 
the assumption of missing at random, and it is violated as 
random effect is negatively correlated to the chance of miss¬ 
ing. We believe this is a fundamental problem issue with 
mixed effect model as missing data pattern is often corre¬ 
lated with some underlying user characteristics that is cor¬ 
related with user’s response to treatment. One might argue 
that mixed effect model can be fixed by throwing more inter¬ 
action terms. However in practice more complex models are 
often not identifiable (parameters more than data points) 
and they only makes more assumptions. We also noted that 
except for condition 3, lme4 provided unbiased estimates for 
condition 2, 4 and 5 where some assumptions in mixed ef¬ 
fect model are violated. We believe central limit theorem 
also helped in this case for lme4 to stay unbiased. But bias 
in condition 3 seems to be more fundamental. We leave a 
more thorough study of the bias of lme4 with violation of 
different assumptions in future work. 

We also compared variance in LME and FORME under the 
crossover model below in Figure]!] Both FORME and lme4 
provided very good estimation of variance. And also as ex¬ 
pected FORME pays a price for its flexibility and almost 
“model free” as variance from lme4 estimations are gener¬ 
ally smaller. The variance gap is bigger when missing rate is 
higher and between-periods correlation is higher. Although 
not shown, in the conditions when either there is no miss¬ 
ing data, or the correlation is 0, FORME and lme4 estimates 
have the same variance. Although lme4 estimate has smaller 
variance, its potential bias is a show-stopper since for treat¬ 
ment effect estimation a low variance estimate is not useful 
if biased. 



Figure 1: Effect Variance in LME and FORME 


1. Normal noise, no treatment effect, normal user random 
effect m ~ N( 0 ,au) 

2. Normal noise, no treatment effect, Poisson user ran¬ 
dom effect Ui ~ Poisson{a^ i ) 

3. Normal noise, with random treatment effect that is 
correlated with user random effect: N(5, erf) x pi 

4. Noise is correlated with User activity level: a = 2 x p; 

5. Noise is correlated with User activity level: a = 4 x pi 


6.2 Simulation from Empirical Data 

Next we randomly sampled from our in-house data a small 
subset of N = 1250 users, randomly split the users into 
equal sized subsets, and applied various designs. We then 
simulate K = 10000 bootstrap samples (with replacement) 
from this dataset, fit FORME and report estimated MLEs. 
The variance based on these MLEs are then compare to 


















the variance estimated from Fisher Information using the 
full dataset. Figure [2] shows the two agrees well. Note the 
cumulative effect had different effect size from the rest of the 
designs. For this particular metric, using CUPED results in 
roughly 50% reduction in variance. Crossover design shows 
reduction of around 50% compare to the parallel design. 



Type 




%samples needed compared to crossover design 


Design 



Figure 3: Percent samples needed to achieve the 
same sensitivity for four metrics. Baseline is the 
crossover design. 


Figure 2: Effect Variance from Fisher’s Information 
and Bootstrap method 

6.3 Real Experiments 

Finally, we report results from three typical metrics in one 
of our real experiments. Here we used percent change as 
the effect size. This way, weekly effect size is comparable 
to cumulative effect size across two weeks. The variance of 
effect size therefore indicates sample size needed to achieve 
the same sensitivity. Figure [3] displays the percent samples 
needed to achieve the same sensitivity for three metrics us¬ 
ing various models, with the crossover design as baseline. 
Therefore crossover design had value of 100. All models 
included CUPED since pre-experiment data always exists 
and is free. The crossover design consistently had the fewest 
samples needed. Next the re-randomize design had value be¬ 
tween crossover and the parallel design. Cumulative design 
follows. When the re-randomize model includes a leftover 
effect, the samples needed can be larger than cumulative 
design for metric 2. Note that compared to the previous 
benchmark, the cumulative design, the crossover design can 
save up to 2/3 the traffic for metric 3, while for others, the 
traffic savings is in the 30-40% range. This is due to inher¬ 
ent difference in week to week correlation in different met¬ 
rics. Note the drastic reduction in variance for such metrics 
means the same feature can be tested with only 1/3 of the 
original traffic! 

7. PRACTICAL CONSIDERATIONS 

At the design stage, we face a few choices under the same 
framework of repeated measures design. Experimenters should 
use domain knowledge and past experiments to inform the 
design. This is rather an art than pure science. Here we 
give guidelines according to our own experience. 

7.1 Recommended Work Flow 

Due to the flexibility in a two-stage setup in repeated mea¬ 
sures design, we can use the information gathered in the 
first stage to inform procedures in the next stage. We rec¬ 
ommend using crossover design as validation stage experi¬ 
ments, for which we already have gathered exploratory di¬ 
rectional data. If the first stage already result in statistical 


significance in KPI, we may choose to terminate the experi¬ 
ment already. However in practice, we generally recommend 
running the experiments long enough to gain enough power 
for not only the KPI, but other metrics designed to moni¬ 
tor data quality and serve as guardrail against unexpected 
changes. 

Otherwise, in running the second stage, we can use domain 
knowledge to inform about carry over effect. If historical 
experiments in similar feature iterations indicate potential 
carry over effect, we recommend running a complete 4-group 
crossover experiment, so we can directly estimate carry over 
effect. Otherwise, we recommend using the 2-group cross¬ 
over design to achieve the maximum power for KPI. If we 
are not sure, it is still possible to leave a few days’ “wash¬ 
out” period after completing the first stage, and see if any 
carry over effect can be observed. 

• No swapping: When it is critically important to en¬ 
sure consistent users experience, such as changing the 
entire layout of a site, it may not be desirable to show 
users the new site for a week, and then swap them back 
to the old site. The experience may be too jarring to 
users and hurt the brand. In such cases, we do not 
recommend re-assigning treatment variants half way 
through. 

• Crossover: Relatively small changes that are less di¬ 
rectly noticeable are better candidates for treatment 
swapping. If similar experiments from the past, or 
earlier exploration data do not indicate the presence of 
carryover effect, the crossover design can be employed. 

• Re-randomized: If we suspect the presence of car¬ 
ryover effect, the re-randomized design enables us to 
measure it directly and should be used here. 

• Wash-out and decide: If we have little informa¬ 
tion to judge carry over effect, we can run the first 
week of the experiment, and then leave a few days as 
a “washout” period. The next stage is data driven. 
Using such data we can estimate the carry over effect 
explicitly. 

— If there is no significant carry over effect, proceed 
as the crossover design. 

— Otherwise, proceed as the re-randomized design. 

Having collected experiment data, they can then analyzed 
in the following work flow to achieve the most power. 







































No swapping: 

— Test equivalence of treatment effect across time 

— If they are equivalence, report treatment effect in 
the “per time unit” metric values by analyzing us¬ 
ing the parallel model, including pre-experiment 
data. 

— Otherwise, analyze only cumulative effects, and 
including pre-experiment data. Note this is CUPED. 

Crossover design: 

— Test equivalence of treatment effect across time 

— If they are equivalence, report treatment effect in 
the “per time unit” metric values by analyzing us¬ 
ing the crossover model, including pre-experiment 
data. 

— If, however, unexpected significant difference is 
found, you have several choices 

* Report the two treatment effects separately 

* To understand the difference properly, an¬ 
other phrase of the experiment can be added, 
using re-randomized design. With a total of 
three weeks’ data, we can see whether the 
treatment effect difference is due to true week- 
to-week to difference, and study its trend, or 
due to carry over effect. 

Re-randomized design: 

— Test equivalence of treatment effect across time 
and presence of carryover effect 

— Reduce the model if any of the effects are not sta¬ 
tistically significant, and report treatment effect. 

This carries the subtle difference of reporting a treatment 
effect in the entire duration of the experiment, versus that 
per time unit (a week here). We argue that as long as weekly 
treatment effects are stable over time, reporting weekly ef¬ 
fect is intuitive, easy to understand, and easy to compare 
across different experiments. In real life, various things can 
happen during an experiment, and we may end up with an 
experiment that ran only in partial weeks. In these cases, re¬ 
porting treatment effect in the entire duration will be better 
than throwing away data or ignore weekdays difference. 


7.2 Sample Size Considerations 

While direct estimation in sample size is difficult in the linear 
mixed model, in practice there is an easy work around. In 
the traditional design, using CLT, with a simple two-sided 
test for Hq : 8 = 0, sample sizes can be easily calculated. 


n = 


(zi - a /2 + z\~p) 2 / 


5 2 

Var(5) 


where a is the allowed false positive rate, usually 0.05, and 
1 — >3 is the desired power, usually at 80% to 90%. 


From historical data we can record the amount of variance 
reduced for each metric. The magnitude is determined by 
inherent variance in the metric, and correlation across time 
periods, both of which are observed to be fairly stable across 
many experiments. Suppose the variance for metric X in 
crossover experiment is k% that in the conventional t-test. 
If N subjects are required to detect a change of <5% in t-test 
with, say, 80% power, then k%N is the reduced sample size 
to achieve the same power. 


8. DISCUSSIONS AND FUTURE WORK 


8.1 Extending to more frequent swaps 

The crossover design achieves sensitivity by exposing users 
to both treatment variants in sequence, by swapping the 
treatment assignment once during the experiment. Using 
each subject as his or her own control and this design to 
account for within-subject variance. A natural extension 
of the idea is to swap treatment groups more than once. 
Essentially, this changes to a more granular randomization 
unit, from users to page views. Exploratory work shows this 
indeed achieves further variance reduction. 

However, this also raises the concern for inconsistent user ex¬ 
perience, diminished treatment effect size, stronger learning 
effect, and lack of a longer term measure. Despite these con¬ 
cerns, it remains a valuable option in early stage experiments 
to quickly select promising features for further iteration. 


8.2 Limitations and concerns 

Due to user behavior differences between weekday and week¬ 
ends, we usually recommend running each phase of the cross¬ 
over design for at least a full week. A crossover experiment 
then requires two complete weeks to gather data, which hin¬ 
ders agility. Another limitation is that for very highly visible 
features like changing prominent UI features, such swapping 
may not be desirable since it may confuse the users. Finally, 
not all features can be tested this way, as there might be 
a “learning” effect, where we can’t have the users exposed 
to treatment “unlearn” the feature, while having controls 
naive to the treatment. For example, if the website provides 
new features and personalized content to signed in users to 
encourage higher rate of signing in and staying signed in. 
These users cannot then be forced to logout into the control 
group. Ma et al. (20111 shows one interesting case where 


crossover design can be extended to tackle this issue. 


8.3 Further Improvements of FORME 

We’ve shown in Section [fTTl that mixed effect model via lme4 
provides a competing estimate of the average treatment ef¬ 
fect that could be biased when missing data pattern corre¬ 
late with user random effect, but often with smaller variance 
than FORME. We noted that FORME has to pay some price 
to be more flexible and robust, similar to nonparametric 
model usually is less efficient than their parametric coun¬ 
terparts. However we suspect that efficiency of FORME 
can be further improved to match the efficiency of mixed 
effect model even under perfect mixed effect model assump¬ 
tion. Such improvement would be very desirable. But even 
without such improvement we believe the bias when there 
is missing data that is not missing at random is a big issue 
for mixed effect model to be adopted in online controlled 
experiment. And FORME should be used instead. 
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